NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Vulcan: Automatic Query Planning for Live ML Analytics

Zhang, Yiwen; Zhang, Xumiao; Ananthanarayanan, Ganesh; Iyer, Anand; Shu, Yuanchao; Bahl, Victor; Mao, Z Morley; Chowdhury, Mosharaf (April 2024, USENIX NSDI)

Full Text Available
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, Association for Computing Machinery, New York, NY, United States)

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5--4.3x and the total delay in fetching and processing contexts by 3.2--3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.
more » « less
Full Text Available
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

https://doi.org/10.1145/3651890.3672274

Liu, Yuhan; Li, Hanchen; Cheng, Yihua; Ray, Siddhant; Huang, Yuyang; Zhang, Qizheng; Du, Kuntai; Yao, Jiayi; Lu, Shan; Ananthanarayanan, Ganesh; et al (August 2024, ACM)

Full Text Available
OneAdapt: Fast Adaptation for Deep Learning Applications via Backpropagation

https://doi.org/10.1145/3620678.3624653

Du, Kuntai; Liu, Yuhan; Hao, Yitian; Zhang, Qizheng; Wang, Haodong; Huang, Yuyang; Ananthanarayanan, Ganesh; Jiang, Junchen (October 2023, SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing)
Tambur: Efficient loss recovery for videoconferencing via streaming codes

Rudow, Michael; Yan, Francis Y.; Kumar, Abhishek; Ananthanarayanan, Ganesh; Ellis, Martin; and Rashmi, K.V. (April 2023, USENIX Symposium on Networked Systems Design and Implementation)

Full Text Available
RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics

Khani, Mehrdad; Ananthanarayanan, Ganesh; Hsieh, Kevin; Jiang, Junchen; Netravali, Ravi; Shu, Yuanchao; Alizadeh, Mohammad; Bahl, Victor (April 2023, 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23))

Full Text Available
Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Padmanabhan, Arthi; Agarwal, Neil; Iyer, Anand; Ananthanarayanan, Ganesh; Shu, Yuanchao; Karianakis, Nikolaos; Xu, Harry; Netravali, Ravi (April 2023, 20th USENIX Symposium on Networked Systems Design and Implementation)

Full Text Available
RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics

Khani, Mehrdad; Ananthanarayanan, Ganesh; Hsieh, Kevin; Jiang, Junchen; Netravali, Ravi; Shu, Yuanchao; Alizadeh, Mohammad; Bahl, Victor (April 2023, USENIX Association)

Full Text Available
Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Padmanabhan, Arthi; Agarwal, Neil; Iyer, Anand; Ananthanarayanan, Ganesh; Shu, Yuanchao; Karianakis, Nikolaos; Xu, Guoqing Harry; Netravali, Ravi (January 2023, NSDI)
Towards Memory-Efficient Inference in Edge Video Analytics

Padmanabhan, Arthi; Iyer, Anand; Ananthanarayanan, Ganesh; Shu, Yuanchao; Karianakis, Nikolaos; Xu, Harry; Netravali, Ravi (January 2022, HotEdgeVideo 2021)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records